Unsupervised discovery of morphologically related words based on 
orthographic and semantic similarity 
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Abstract 

We present an algorithm that takes an 
unannotated corpus as its input, and re- 
turns a ranked list of probable morpho- 
logically related pairs as its output. The 
algorithm tries to discover morphologi- 
cally related pairs by looking for pairs 
that are both orthographically and seman- 
tically similar, where orthographic simi- 
larity is measured in terms of minimum 
edit distance, and semantic similarity is 
measured in terms of mutual information. 
The procedure does not rely on a mor- 
pheme concatenation model, nor on dis- 
tributional properties of word substrings 
(such as affix frequency). Experiments 
with German and English input give en- 
couraging results, both in terms of pre- 
cision (proportion of good pairs found at 
various cutoff points of the ranked list), 
and in terms of a qualitative analysis of 
the types of morphological patterns dis- 
covered by the algorithm. 

1 Introduction 

In recent years, there has been much interest in com- 
putational models that learn aspects of the morphol- 
ogy of a natural language from raw or structured 
data. Such models are of great practical interest as 
tools for descriptive linguistic analysis and for mini- 
mizing the expert resources needed to develop mor- 
phological analyzers and stemmers. From a theo- 



retical point of view, morphological learning algo- 
rithms can help answer questions related to human 
language acquisition. 

In this study, we present a system that, given a 
corpus of raw text from a language, returns a ranked 
list of probable morphologically related word pairs. 
For example, when run with the Brown corpus as 
its input, our system returned a list with pairs such 
as pencil/pencils and structured/unstructured at the 
top. 

Our algorithm is completely knowledge-free, in 
the sense that it processes raw corpus data, and it 
does not require any form of a priori information 
about the language it is applied to. The algorithm 
performs unsupervised learning, in the sense that it 
does not require a correctly-coded standard to (iter- 
atively) compare its output against. 

The algorithm is based on the simple idea that a 
combination of formal and semantic cues should be 
exploited to identify morphologically related pairs. 
In particular, we use minimum edit distance to mea- 
sure orthographic similarity, 1 and mutual informa- 
tion to measure semantic similarity. The algo- 
rithm does not rely on the notion of affix, and it 
does not depend on global distributional properties 
of substrings (such as affix frequency). Thus, at 
least in principle, the algorithm is well-suited to 
discover pairs that are related by rare and/or non- 
concatenative morphological processes. 

The algorithm returns a list of related pairs, but 
it does not attempt to extract the patterns that relate 
the pairs. As such, it can be used as a tool to pre- 



1 Given phonetically transcribed input, our model would 
compute phonetic similarity instead of orthographic similarity. 



process corpus data for an analysis to be performed 
by a human morphologist, or as the first step of a 
fully automated morphological learning program, to 
be followed, for example, by a rule induction pro- 
cedure that extracts correspondence patterns from 
paired forms. See the last section of this paper for 
further discussion of possible applications. 

We tested our model with German and English 
input. Our results indicate that the algorithm is 
able to identify a number of pairs related by a va- 
riety of derivational and inflectional processes with 
a remarkably high precision rate. The algorithm is 
also discovering morphological relationships (such 
as German plural formation with umlaut) that would 
probably be harder to discover using affix-based ap- 
proaches. 

The remainder of the paper is organized as fol- 
lows: In section 2, we shortly review related work. 
In section 3, we present our model. In section 4, we 
discuss the results of experiments with German and 
English input. Finally, in section 5 we summarize 
our main results, we sketch possible directions that 
our current work could take, and we discuss some 
potential uses for the output of our algorithm. 

2 Related work 

For space reason, we discuss here only three ap- 
proaches that are closely related to ours. See, for 
example, Goldsmith (2001) for a very different (pos- 
sibly complementary) approach, and for a review of 
other relevant work. 

2.1 Jacquemin (1997) 

Jacquemin (1997) presents a model that automati- 
cally extracts morphologically related forms from a 
list of English two-word medical terms and a corpus 
from the medical domain. 

The algorithm looks for correspondences between 
two-word terms and orthographically similar pairs 
of words that are adjacent in the corpus. For exam- 
ple, the list contains the term artificial ventilation, 
and the corpus contains the phrase artificially ven- 
tilated. Jacquemin's algorithm thus postulates the 
(paired) morphological analyses artificial ventilat- 
ion and artificial-ly ventilat-ed. 

Similar words, for the purposes of this pairing 
procedure, are simply words that share a common 



left substring (with constraints that we do not dis- 
cuss here). 

Jacquemin's procedure then builds upon these 
early steps by clustering together sets that follow the 
same patterns, and using these larger classes to look 
for spurious analyses. Finally, the algorithm tries to 
cluster classes that are related by similar, rather than 
identical, suffixation patterns. Again, we will not 
describe here how this is accomplished. 

Our basic idea is related to that of Jacquemin, but 
we propose an approach that is more general both 
in terms of orthography and in terms of semantics. 
In terms of orthography, we do not require that two 
strings share the left (or right) substring in order to 
constitute a candidate pair. Thus, we are not limited 
to affixal morphological patterns. Moreover, our al- 
gorithm extracts semantic information directly from 
the input corpus, and thus it does not require a pre- 
compiled list of semantically related pairs. 

2.2 Schone and Jurafsky (2000) 

Schone and Jurafsky (2000) present a knowledge- 
free unsupervised model in which orthography- 
based distributional cues are combined with seman- 
tic information automatically extracted from word 
co-occurrence patterns in the input corpus. 

They first look for potential suffixes by search- 
ing for frequent word-final substrings. Then, they 
look for potentially morphologically related pairs, 
i.e., pairs that end in potential suffixes and share the 
left substring preceding those suffixes. Finally, they 
look, among those pairs, for those whose semantic 
vectors (computed using latent semantic analysis) 
are significantly correlated. In short, the idea behind 
the semantic component of their model is that words 
that tend to co-occur with the same set of words, 
within a certain window of text, are likely to be se- 
mantically correlated words. 

While we follow Schone and Jurafsky 's idea of 
combining orthographic and semantic cues, our al- 
gorithm differs from them in both respects. From the 
point of view of orthography, we rely on the com- 
parison between individual word pairs, without re- 
quiring that the two pairs share a frequent affix, and 
indeed without requiring that they share an affix at 
all. 

From the point of view of semantics, we compute 
scores based on mutual information instead of latent 



semantic analysis. Thus, we only look at the co- 
occurrence patterns of target words, rather than at 
the similarity of their contexts. 

Future research should try to assess to what extent 
these two approaches produce significantly different 
results, and/or to what extent they are complemen- 
tary. 

2.3 Yarowsky and Wicentowski (2000) 

Yarowsky and Wicentowski (2000) propose an algo- 
rithm that extracts morphological rules relating roots 
and inflected forms of verbs (but the algorithm can 
be extended to other morphological relations). 

Their algorithm performs unsupervised, but not 
completely knowledge-free, learning. It requires 
a table of canonical suffixes for the relevant parts 
of speech of the target language, a list of the con- 
tent word roots with their POS (and some informa- 
tion about the possible POS/inflectional features of 
other words), a list of the consonants and vowels of 
the language, information about some characteristic 
syntactic patterns and, if available, a list of function 
words. 

The algorithm uses a combination of different 
probabilistic models to find pairs that are likely to be 
morphologically related. One model matches root 
+ inflected form pairs that have a similar frequency 
profile. Another model matches root + inflected 
form pairs that tend to co-occur with the same sub- 
jects and objects (identified using simple regular ex- 
pressions). Yet another model looks for words that 
are orthographically similar, in terms of a minimum 
edit distance score that penalizes consonant changes 
more than vowel changes. Finally, the rules relating 
stems and inflected forms that the algorithm extracts 
from the pairs it finds in an iteration are used as a 
fourth probabilistic model in the subsequent itera- 
tions. 

Yarowsky and Wicentowski show that the al- 
gorithm is extremely accurate in identifying En- 
glish root + past tense form pairs, including those 
pairs that are related by non-affixal patterns (e.g., 
think/thought.) 

The main issue with this model is, of course, that 
it cannot be applied to a new target language with- 
out having some a priori knowledge about some of 
its linguistic properties. Thus, the algorithm can- 
not be applied in cases in which the grammar of 



the target language has not been properly described 
yet, or when the relevant information is not available 
for other reasons. Moreover, even when such infor- 
mation is in principle available, trying to determine 
to what extent morphology could be learned with- 
out relying on any other knowledge source remains 
an interesting theoretical pursuit, and one whose an- 
swer could shed some light on the problem of human 
language acquisition. 

3 The current approach: Morphological 
relatedness as a function of orthographic 
and semantic similarity 

The basic intuition behind the model presented here 
is extremely simple: Morphologically related words 
tend to be both orthographically and semantically 
similar. Obviously, there are many words that are or- 
thographically similar, but are not morphologically 
related; for example, blue and glue. At the same 
time, many semantically related words are not mor- 
phologically related (for example, blue and green). 
However, if two words have a similar shape and a 
related meaning (e.g., green and greenish), they are 
very likely to be also morphologically related. 

In order to make this idea concrete, we use min- 
imum edit distance to identify words that are ortho- 
graphically similar, and mutual information between 
words to identify semantically related words. 

3.1 Outline of the procedure 

Given an unannotated input corpus, the algorithm 
(after some elementary tokenization) extracts a list 
of candidate content words. This is simply a list 
of all the alphabetic space- or punctuation-delimited 
strings in the corpus that have a corpus frequency 
below .01% of the total token count. 2 

Preliminary experiments indicated that our proce- 
dure does not perform as well without this trimming. 
Notice in any case that function words tend to be of 
little morphological interest, as they display highly 
lexicalized, often suppletive morphological patterns. 

The word list extracted as described above and the 
input corpus are used to compute two lists of word 
pairs: An orthographic similarity list, in which the 

2 In future versions of the algorithm, we plan to make this 
high frequency threshold dependent on the size of the input cor- 
pus. 



pairs are scored on the basis of their minimum edit 
distance, and a semantic similarity list, based on mu- 
tual information. Because of minimum thresholds 
that are enforced during the computation of the two 
measures, neither list contains all the pairs that can 
in principle be constructed from the input list. 

Before computing the combined score, we get rid 
of the pairs that do not occur in both lists (the ra- 
tionale being that we do not want to guess the mor- 
phological status of a pair on the sole basis of ortho- 
graphic or semantic evidence). 

We then compute a weighted sum of the ortho- 
graphic and semantic similarity scores of each re- 
maining pair. In the experiments reported below, the 
weights are chosen so that the maximum weighted 
scores for the two measures are in the same order 
of magnitude (we prefer to align maxima rather than 
means because both lists are trimmed at the bottom, 
making means and other measures of central ten- 
dency less meaningful). 

The pairs are finally ranked on the basis of the 
resulting combined scores. 

In the next subsections, we describe how the or- 
thographic and semantic similarity lists are con- 
structed, and some properties of the measures we 
adopted. 

3.2 Scoring the orthographic similarity of 
word pairs 

Like Yarowsky and Wicentowski, we use mini- 
mum edit distance to measure orthographic simi- 
larity. The minimum edit distance between two 
strings is the minimum number of editing oper- 
ations (insertion, deletion, substitution) needed to 
transform one string into the other (see section 5.6 
of Jurafsky and Martin (2000) and the references 
quoted there). 

Unlike Yarowsky and Wicentowski, we do not at- 
tempt to define a phonologically sensible edit dis- 
tance scoring function, as this would require making 
assumptions about how the phonology of the target 
language maps onto its orthography, thus falling out- 
side the domain of knowledge-free induction. In- 
stead, we assign a cost of 1 to all editing operations, 
independently of the nature of the source and target 
segments. Thus, in our system, the pairs dog/Dog, 
man/men, bat/mat and day/dry are all assigned a 



minimum edit distance of 1. 

Rather than computing absolute minimum edit 
distance, we normalize this measure by dividing 
it by the length of the longest string (this corre- 
sponds to the intuition that, say, two substitutions 
are less significant if we are comparing two eight- 
letter words than if we are comparing two three- 
letter words). Moreover, since we want to rank pairs 
on the basis of orthographic similarity, rather than 
dissimilarity, we compute (1 - normalized minimum 
edit distance), obtaining a measure that ranges from 
1 for identical forms to for forms that do not share 
any character. 

This measure is computed for all pairs of words in 
the potential content word list. However, for reasons 
of size, only pairs that have a score of .5 or higher 
(i.e., where the two members share at least half of 
their characters) are recorded in the output list. 

Notice that orthographic similarity does not favor 
concatenative affixal morphology over other types 
of morphological processes. For example, the pairs 
woman/women and park/parks both have an ortho- 
graphic similarity score of .8. 

Moreover, orthographic similarity depends only 
on the two words being compared, and not on global 
distributional properties of these words and their 
substrings. Thus, words related by a rare morpho- 
logical pattern can have the same score as words 
related by a very frequent pattern, as long as the 
minimum edit distance is the same. For example, 
both nucleus/nuclei and bench/benches have an or- 
thographic similarity score of .714, despite the fact 
that the latter pair reflects a much more common plu- 
ralization pattern. 

Of course, this emancipation from edge-anchored 
concatenation and global distributional salience also 
implies that orthographic similarity will assign high 

'Following a suggestion by two reviewers, we are currently 
experimenting with an iterative version of our algorithm, along 
the lines of the one described by Yarowsky and Wicentowski. 
We start with the cost matrix described in the text, but we re- 
estimate the editing costs on the basis of the empirical character- 
to-character (or character-to-zero/zero-to-character) probabili- 
ties observed in the output of the previous run of the algorithm. 
Surprisingly, the revised version of the algorithm leads to (mod- 
erately) worse results than the single-run version described in 
this paper. Further experimentation with edit cost re-estimation 
is needed, in order to understand which aspects of our iterative 
procedure make it worse than the single-run model, and how it 
could be improved. 



scores to many pairs that are not morphologically 
related - for example, the pair friends/trends also has 
an orthographic similarity score of .714. 

Furthermore, since in most languages the range of 
possible word lengths is narrow, orthographic simi- 
larity as a ranking measure tends to suffer of a "mas- 
sive tying" problem. For example, when pairs from 
the German corpus described below are ranked on 
the sole basis of orthographic similarity, the result- 
ing list is headed by a block of 19,597 pairs that all 
have the same score. These are all pairs where one 
word has 9 characters, the other 9 or 8 characters, 
and the two differ in only one character. 4 

For the above reasons, it is crucial that ortho- 
graphic similarity is combined with an independent 
measure that allows us to distinguish between simi- 
larity due to morphological relatedness vs. similar- 
ity due to chance or other reasons. 

3.3 Scoring the semantic similarity of word 
pairs 

Measuring the semantic similarity of words on the 
basis of raw corpus data is obviously a much harder 
task than measuring the orthographic similarity of 
words. 

Mutual information (first introduced to compu- 
tational linguistics by Church and Hanks (1989)) is 
one of many measures that seems to be roughly 
correlated to the degree of semantic relatedness be- 
tween words. The mutual information between two 
words A and B is given by: 



I (A B ) = log 



Pr(A,B) 
Pr(A)Pr(B) 



(1) 



Intuitively, the larger the deviation between the 
empirical frequency of co-occurrence of two words 
and the expected frequency of co-occurrence if they 
were independent, the more likely it is that the oc- 
currence of one of the two words is not independent 
from the occurrence of the other. 

Brown et alii (1990) observed that when mutual 
information is computed in a bi-directional fashion, 
and by counting co-occurrences of words within a 



4 Most of the pairs in this block - 78% - are actually morpho- 
logically related. However, given that all pairs contain words 
of length 9 and 8/9 that differ in one character only, they are 
bound to reflect only a very small subset of the morphological 
processes present in German. 



relatively large window, but excluding "close" co- 
occurrences (which would tend to capture colloca- 
tions and lexicalized phrases), the measure identifies 
semantically related pairs. 

It is particularly interesting for our purposes that 
most of the examples of English word clusters con- 
structed on the basis of this interpretation of mutual 
information by Brown and colleagues (reported in 
their table 6) include morphologically related words. 
A similar pattern emerges among the examples of 
German words clustered in a similar manner by 
Baroni et alii (2002). Rosenfeld (1996) reports that 
morphologically related pairs are common among 
words with a high (average) mutual information. 

We computed mutual information by considering, 
for each pair, only co-occurrences within a maxi- 
mal window of 500 words and outside a minimal 
window of 3 words. Given that mutual informa- 
tion is notoriously unreliable at low frequencies (see, 
for example, Manning and Schutze (1999), section 
5.4), we only collected mutual information scores 
for pairs that co-occurred at least three times (within 
the relevant window) in the input corpus. Obvi- 
ously, occurrences across article boundaries were 
not counted. Notice however that the version of the 
Brown corpus we used does not mark article bound- 
aries. Thus, in this case the whole corpus was treated 
as a single article. 

Our "semantic" similarity measure is based on the 
notion that related words will tend to often occur in 
the nears of each other. This differs from the (more 
general) approach of Schone and Jurafsky (2000), 
who look for words that tend to occur in the same 
context. It remains an open question whether the 
two approaches produce complementary or redun- 
dant results. 5 

Taken by itself, mutual information is a worse 
predictor of morphological relatedness than mini- 
mum edit distance. For example, among the top one 
hundred pairs ranked by mutual information in each 
language, only one German pair and five English 
pairs are morphologically motivated. This poor per- 
formance is not too surprising, given that there are 



5 We are currently experimenting with a measure based on 
semantic context similarity (determined on the basis of class- 
based left-to-right and right-to-left bigrams), but the current im- 
plementation of this requires ad hoc corpus-specific settings to 
produce interesting results with both our test corpora. 



plenty of words that often co-occur together without 
being morphologically related. Consider for exam- 
ple (from our English list) the pairs index/operand 
and orthodontist/teeth. 

4 Empirical evaluation 
4.1 Materials 

We tested our procedure on the German APA corpus, 
a corpus of newswire containing over twenty-eight 
million word tokens, and on the English Brown cor- 
pus (Kucera and Francis, 1967), a balanced corpus 
containing less than one million two hundred thou- 
sand word tokens. Of course, the most important dif- 
ference between these two corpora is that they rep- 
resent different languages. However, observe also 
that they have very different sizes, and that they are 
different in terms of the types of texts constituting 
them. 

Besides the high frequency trimming procedure 
described above, for both languages we removed 
from the potential content word lists those words 
that were not recognized by the XEROX morpholog- 
ical analyzer for the relevant language. The reason 
for this is that, as we describe below, we use this tool 
to build the reference sets for evaluation purposes. 
Thus, morphologically related pairs composed of 
words not recognized by the analyzer would unfairly 
lower the precision of our algorithm. 

Moreover, after some preliminary experimenta- 
tion, we also decided to remove words longer than 9 
characters from the German list (this corresponds to 
trimming words whose length is one standard devi- 
ation or more above the average token length). This 
actually lowers the performance of our system, but 
makes the results easier to analyze - otherwise, the 
top of the German list would be cluttered by a high 
number of rather uninteresting morphological pairs 
formed by inflected forms from the paradigm of 
very long nominal compounds (such as Wirtschafts- 
forschungsinstitut 'institute for economic research'). 

Unlike high frequency trimming, the two opera- 
tions we just described are meant to facilitate empir- 
ical evaluation, and they do not constitute necessary 
steps of the core algorithm. 



4.2 Precision 

In order to evaluate the precision obtained by our 
procedure, we constructed a list of all the pairs that, 
according to the analysis provided by the XEROX 
analyzer for the relevant language, are morpholog- 
ically related (i.e., share one of their stems). 6 We 
refer to the lists constructed in the way we just de- 
scribed as reference sets. 

The XEROX tools we used do not provide deriva- 
tional analysis for English, and a limited form of 
derivational analysis for German. Our algorithm, 
however, finds both inflectionally and derivationally 
related pairs. Thus, basing our evaluation on a com- 
parison with the XEROX parses leads to an underes- 
timation of the precision of the algorithm. We found 
that this problem is particularly evident in English, 
since English, unlike German, has a rather poor in- 
flectional morphology, and thus the discrepancies 
between our output and the analyzer parses in terms 
of derivational morphology have a more visible im- 
pact on the results of the comparison. For example, 
the English analyzer does not treat pairs related by 
the adverbial suffix -ly or by the prefix un- as mor- 
phologically related, whereas our algorithm found 
pairs such as soft/softly and load/unload. 

In order to obtain a more fair assessment of the 
algorithm, we went manually through the first 2,000 
English pairs found by our algorithm but not parsed 
as related by the analyzer, looking for items to be 
added to the reference set. We were extremely 
conservative, and we added to the reference set 
only those pairs that are related by a transparent 
and synchronically productive morphological pat- 
tern. When in doubt, we did not correct the analyzer- 
based analysis. Thus, for example, we did not count 
pairs such as machine/machinery, variables/varies 
or electric/electronic as related. 

We did not perform any manual post-processing 
on the German reference set. 

Tables 1 and 2 report percentage precision (i.e., 
the percentage of pairs that are in the reference set 
over the total number of ranked pairs up to the rele- 
vant threshold) at various cutoff points, for German 
and English respectively. 

6 The XEROX morphological analyzers are state-of-the-art, 
knowledge-driven morphological analysis tools (see for exam- 
ple Karttunen et alii (1997)). 



# of pairs 


precision 


500 


97% 


1000 


96% 


1500 


96% 


2000 


94% 


3000 


81% 


4000 


65% 


5000 


53% 


5279 


50% 



Table 1: German precision at various cutoff points 
(5279 = total number of pairs) 



# of pairs 


precision 


500 


98% 


1000 


95% 


1500 


91% 


2000 


83% 


3000 


72% 


4000 


58% 


5000 


48% 


8902 


29% 



Table 2: English precision at various cutoff points 
(8902 = total number of pairs) 

For both languages we notice a remarkably high 
precision rate (> 90%) up to the 1500-pair cutoff 
point. 

After that, there is a sharper drop in the English 
precision, whereas the decline in German is more 
gradual. This is perhaps due in part to the problems 
with the English reference set we discussed above, 
but notice also that English has an overall poorer 
morphological system and that the English corpus is 
considerably smaller than the German one. Indeed, 
our reference set for German contains more than ten 
times the forms in the English reference set. 

Notice anyway that, for both languages, the preci- 
sion rate is still around 50% at the 5000-pair cutoff. 7 

7 Yarowsky and Wicentowski (2000) report an accuracy of 
over 99% for their best model and a test set of 3888 pairs. Our 
precision rate at a comparable cutoff point is much lower (58% 
at the 4000-pair cutoff). However, Yarowksy and Wicentowski 
restricted the possible matchings to pairs in which one member 
is an inflected verb form, and the other member is a potential 
verbal root, whereas in our experiments any word in the corpus 
(as long as it was below a certain frequency threshold, and it was 
recognized by the XEROX analyzer) could be matched with any 
other word in the corpus. Thus, on the one hand, Yarowsky and 
Wicentowski forced the algorithm to produce a matching for a 
certain set of words (their set of inflected forms), whereas our 
algorithm was not subject to an analogous constraint. On the 
other hand, though, our algorithm had to explore a much larger 
possible matching space, and it could (and did) make a high 
number of mistakes on pairs (such as, e.g., sorry and worry) that 



Of course, what counts as a "good" precision rate 
depends on what we want to do with the output of 
our procedure. We show below that even a very 
naive morphological rule extraction algorithm can 
extract sensible rules by taking whole output lists as 
its input, since, although the number of false pos- 
itives is high, they are mostly related by patterns 
that are not attested as frequently in the list as the 
patterns relating true morphological pairs. In other 
words, true morphological pairs tend to be related 
by patterns that are distributionally more robust than 
those displayed by false positives. Thus, rule ex- 
tractors and other procedures processing the output 
of our algorithm can probably tolerate a high false 
positive rate if they take frequency and other distri- 
butional properties of patterns into account. 

Notice that we discussed only precision, and not 
recall. This is because we believe that the goal of a 
morphological discovery procedure is not to find the 
exhaustive list of all morphologically related forms 
in a language (indeed, because of morphological 
productivity, such list is infinite), but rather to dis- 
cover all the possible (synchronically active and/or 
common) morphological processes present in a lan- 
guage. It is much harder to measure how good our 
algorithm performed in this respect, but the qualita- 
tive analysis we present in the next subsection indi- 
cates that, at least, the algorithm discovers a varied 
and interesting set of morphological processes. 

4.3 Morphological patterns discovered by the 
algorithm 

The precision tables confirm that the algorithm 
found a good number of morphologically related 
pairs. However, if it turned out that all of these 
pairs were examples of the same morphological pat- 
tern (say, nominal plural formation in -s), the al- 
gorithm would not be of much use. Moreover, we 
stated at the beginning that, since our algorithm does 
not assume an edge-based stem+affix concatenation 
model of morphology, it should be well suited to dis- 
cover relations that cannot be characterized in these 

Yarowksy and Wicentowski's algorithm did not have to con- 
sider. Schone and Jurafsky (2000) report a maximum precision 
of 92%. It is hard to compare this with our results, since they 
use a more sophisticated scoring method (based on paradigms 
rather than pairs) and a different type of gold standard. More- 
over, they do not specify what was the size of the input they 
used for evaluation. 



terms (e.g., pairs related by circumfixation, stem 
changes, etc.). It is interesting to check whether the 
algorithm was indeed able to find relations of this 
sort. 

Thus, we performed a qualitative analysis of the 
output of the algorithm, trying to understand what 
kind of morphological processes were captured by 
it. 

In order to look for morphological processes in 
the algorithm output, we wrote a program that ex- 
tracts "correspondence rules" in the following sim- 
ple way: For each pair, the program looks for the 
longest shared (case-insensitive) left- and right-edge 
substrings (i.e., for a stem + suffix parse and for a 
prefix + stem parse). The program then chooses the 
parse with the longest stem (assuming that one of the 
two parses has a non-zero stem), and extracts the rel- 
evant edge-bound correspondence rule. If there is a 
tie, the stem + suffix parse is preferred. The program 
then ranks the correspondence rules on the basis of 
their frequency of occurrence in the original output 
list. 8 

We want to stress that we are adopting this proce- 
dure as a method to explore the results, and we are 
by no means proposing it as a serious rule induction 
algorithm. One of the most obvious drawbacks of 
the current rule extraction procedure is that it is only 
able to extract linear, concatenative, edge-bound suf- 
fixation and prefixation patterns, and thus it misses 
or fails to correctly generalize some of the most in- 
teresting patterns in the output. Indeed, looking at 
the patterns missed by the algorithm (as we do in 
part below) is as instructive as looking at the rules it 
found. 

Tables 3 and 4 report the top five suffixation and 
prefixation patterns found by the rule extractor by 
taking the entire German and English output lists as 
its input. 

These tables show that our morphological pair 
scoring procedure found many instances of various 
common morphological patterns. With the excep- 
tion of the German "prefixation" rule ers^drit (ac- 
tually relating the roots of the ordinals 'first' and 
'second'), and of the compounding pattern e 
('Oil'), all the rules in these lists correspond to re- 
alistic affixation patterns. Not surprisingly, in both 

"Ranking by cumulative score yields analogous results. 



rule 


example 


fq 


e^s 


Jelzin^Jelzins 


921 


e^n 


lautete<->lauteten 


670 


e<->en 


digital^digitalen 


225 


e^e 


rot<-^rote 


201 


e<-^es 


Papst^Papstes 


113 


e^ge 


stiegen^gestiegen 


9 


e^Ol 


Embargo^Olembargo 


6 


e^vor 


Mittag^Vormittag 


5 


aus<-^ein 


ausfuhren^einfuhren 


4 


ers^drit 


Erstens^Drittens 


4 



Table 3: The most common German suffixation and 
prefixation patterns 



rule 


example 


fq 


e<->s 


allotments allotments 


860 


e<->ed 


accomplish^accomplished 


98 


edging 


established^establishing 


87 


e^ing 


experiment <->experimenting 


85 


e^d 


conjugate^conjugated 


58 


e^un 


structured^unstructured 


17 


e^re 


organization^reorganization 


12 


e^in 


organic^inorganic 


7 


e^non 


specifically^nonspecifically 


6 


e^dis 


satisfied^dissatisfied 


5 



Table 4: The most common English suffixation and 
prefixation patterns 

languages many of the most frequent rules (such as, 
e.g., e <->s) are poly-functional, corresponding to a 
number of different morphological relations within 
and across categories. 

The results reported in these tables confirm that 
the algorithm is capturing common affixation pro- 
cesses, but they are based on patterns that are so 
frequent that even a very naive procedure could un- 
cover them 9 

More interesting observations emerge from fur- 
ther inspection of the ranked rule files. For exam- 
ple, among the 70 most frequent German suffixation 
rules extracted by the procedure, we encounter those 
in table 5. 10 

The patterns in this table show that our algorithm 
is capturing the non-concatenative plural formation 

'For example, as shown by a reviewer, a procedure that pairs 
words that share the same first five letters, and extracts the di- 
verging substrings following the common prefix from each pair. 

10 In order to find the set of rules presented in table 5 using the 
naive algorithm described in the previous footnote, we would 
have to consider the 2672 most frequent rules. Most of these 
2672 rules, of course, do not correspond to true morphological 
patterns - thus, the interesting rules would be buried in noise. 



rule 


example 


fq 


ag^age 


Anschlag^Anschlage 


10 


ang^ange 


Riickgang^Riickgange 


6 


all<-^alle 


Uberfall^Uberfalle 


6 


ug^iige 


Tiefflug^Tieffliige 


5 


and^ande 


Vorstand^Vorstande 


5 


uch^iiche 


Einbruch^Einbriiche 


3 


auf^>aufe 


Verkauf*->Verkaufe 


3 


ag^agen 


Vertrag <-> Vertragen 


3 



Table 5: Some German rules involving stem vowel 
changes found by the rule extractor 

process involving fronting of the stem vowel plus 
addition of a suffix (-e/-en). A smarter rule extractor 
should be able to generalize from patterns like these 
to a smaller number of more general rules capturing 
the discontinuous change. Other umlaut-based pat- 
terns that do not involve concomitant suffixation - 
such as in Mutter/Mutter - were also found by our 
core algorithm, but they were wrongly parsed as in- 
volving prefixes (e.g., Mu^Mu) by the rule extrac- 
tor. 

Finally, it is very interesting to look at those pairs 
that are morphologically related according to the 
XEROX analyzer, and that were discovered by our 
algorithm, but where the rule extractor could not 
posit a rule, since they do not share a substring at 
either edge. These are listed, for German, in table 6. 



Table 6: Morphologically related German pairs that 
do not share an edge found by the basic algorithm 

We notice in this table, besides three further in- 
stances of non-affixal morphology, a majority of 
pairs involving circumfixation of one of the mem- 
bers. 

While a more in-depth qualitative analysis of our 
results should be conducted, the examples we dis- 
cussed here confirm that our algorithm is able to cap- 
ture a number of different morphological patterns, 
including some that do not fit into a strictly concate- 



native edge-bound stem+affix model. 

5 Conclusion and Future Directions 

We presented an algorithm that, by taking a raw cor- 
pus as its input, produces a ranked list of morpho- 
logically related pairs at its output. The algorithm 
finds morphologically related pairs by looking at the 
degree of orthographic similarity (measured by min- 
imum edit distance) and semantic similarity (mea- 
sured by mutual information) between words from 
the input corpus. 

Experiments with German and English inputs 
gave encouraging results, both in terms of precision, 
and in terms of the nature of the morphological pat- 
terns found within the output set. 

In work in progress, we are exploring various pos- 
sible improvements to our basic algorithm, includ- 
ing iterative re-estimation of edit costs, addition of a 
context-similarity-based measure, and extension of 
the output set by morphological transitivity, i.e. the 
idea that if word a is related to word b, and word b 
is related to word c, then word a and word c should 
also form a morphological pair. 

Moreover, we plan to explore ways to relax the re- 
quirement that all pairs must have a certain degree of 
semantic similarity to be treated as morphologically 
related (there is evidence that humans treat certain 
kinds of semantically opaque forms as morpholog- 
ically complex - see Baroni (2000) and the refer- 
ences quoted there). This will probably involve tak- 
ing distributional properties of word substrings into 
account. 

From the point of view of the evaluation 
of the algorithm, we should design an as- 
sessment scheme that would make our exper- 
imental results more directly comparable to 
those of Yarowsky and Wicentowski (2000), 
Schone and Jurafsky (2000) and others. Moreover, 
a more in depth qualitative analysis of the results 
should concentrate on identifying specific classes of 
morphological processes that our algorithm can or 
cannot identify correctly. 

We envisage a number of possible uses for the 
ranked list that constitutes the output of our model. 
First, the model could provide the input for a 
more sophisticated rule extractor, along the lines of 
those proposed by Albright and Hayes (1999) and 



Alter alteren 
Arzt Arzte 
Arztes Arzte 
Fesseln gefesselt 
Folter gefoltert 
Putsch geputscht 
Spende gespendet 
Spenden gespendet 
Streik gestreikt 



fordern gefordert 
forderten gefordert 
fordern gefordert 
genannt nannte 
genannten nannte 
geprallt prallte 
gesetzt setzte 
gestiirzt stiirzte 



Neuvel (2002). Such models extract morphologi- 
cal generalizations in terms of correspondence pat- 
terns between whole words, rather than in terms of 
affixation rules, and are thus well suited to iden- 
tify patterns involving non-concatenative morphol- 
ogy and/or morphophonological changes. A list 
of related words constitutes a more suitable input 
for them than a list of words segmented into mor- 
phemes. 

Rules extracted in this way would have a number 
of practical uses - for example, they could be used 
to construct stemmers for information retrieval ap- 
plications, or they could be integrated into morpho- 
logical analyzers. 

Our procedure could also be used to re- 
place the first step of algorithms, such as those 
of Goldsmith (2001) and Snover and Brent (2001), 
where heuristic methods are employed to generate 
morphological hypotheses, and then an information- 
theoretically/probabilistically motivated measure is 
used to evaluate or improve such hypotheses. More 
in general, our algorithm can help reduce the size 
of the search space that all morphological discovery 
procedures must explore. 

Last but not least, the ranked output of (an im- 
proved version of) our algorithm can be of use to 
the linguist analyzing the morphology of a language, 
who can treat it as a way to pre-process her/his 
data, while still relying on her/his analytical skills 
to extract the relevant morphological generalizations 
from the ranked pairs. 
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